{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "[![AWS SDK for pandas](_static/logo.png \"AWS SDK for pandas\")](https://github.com/aws/aws-sdk-pandas)\n",
    "\n",
    "# 34 - Distributing Calls Using Ray\n",
    "\n",
    "AWS SDK for pandas supports distribution of specific calls using [ray](https://docs.ray.io/) and [modin](https://modin.readthedocs.io/en/stable/).\n",
    "\n",
    "When enabled, data loading methods return modin dataframes instead of pandas dataframes. Modin provides seamless integration and compatibility with existing pandas code, with the benefit of distributing operations across your Ray instance and operating at a much larger scale."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "!pip install \"awswrangler[modin,ray,redshift]\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "Importing `awswrangler` when `ray` and `modin` are installed will automatically initialize a local Ray instance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Execution Engine: EngineEnum.RAY\n",
      "Memory Format: MemoryFormatEnum.MODIN\n"
     ]
    }
   ],
   "source": [
    "import awswrangler as wr\n",
    "\n",
    "print(f\"Execution Engine: {wr.engine.get()}\")\n",
    "print(f\"Memory Format: {wr.memory_format.get()}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "#### Read data at scale\n",
    "\n",
    "Data is read using all cores on a single machine or multiple nodes on a cluster"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2023-09-15 12:24:44,457\tINFO worker.py:1621 -- Started a local Ray instance.\n",
      "2023-09-15 12:25:10,728\tINFO read_api.py:374 -- To satisfy the requested parallelism of 200, each read task output will be split into 34 smaller blocks.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[dataset]: Run `pip install tqdm` to enable progress reporting.\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "UserWarning: When using a pre-initialized Ray cluster, please ensure that the runtime env sets environment variable __MODIN_AUTOIMPORT_PANDAS__ to 1\n"
     ]
    },
    {
     "data": {
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>vendor_id</th>\n      <th>pickup_at</th>\n      <th>dropoff_at</th>\n      <th>passenger_count</th>\n      <th>trip_distance</th>\n      <th>rate_code_id</th>\n      <th>store_and_fwd_flag</th>\n      <th>pickup_location_id</th>\n      <th>dropoff_location_id</th>\n      <th>payment_type</th>\n      <th>fare_amount</th>\n      <th>extra</th>\n      <th>mta_tax</th>\n      <th>tip_amount</th>\n      <th>tolls_amount</th>\n      <th>improvement_surcharge</th>\n      <th>total_amount</th>\n      <th>congestion_surcharge</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>1</td>\n      <td>2019-01-01 00:46:40</td>\n      <td>2019-01-01 00:53:20</td>\n      <td>1</td>\n      <td>1.5</td>\n      <td>1</td>\n      <td>N</td>\n      <td>151</td>\n      <td>239</td>\n      <td>1</td>\n      <td>7.0</td>\n      <td>0.5</td>\n      <td>0.5</td>\n      <td>1.65</td>\n      <td>0.0</td>\n      <td>0.3</td>\n      <td>9.950000</td>\n      <td>NaN</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>1</td>\n      <td>2019-01-01 00:59:47</td>\n      <td>2019-01-01 01:18:59</td>\n      <td>1</td>\n      <td>2.6</td>\n      <td>1</td>\n      <td>N</td>\n      <td>239</td>\n      <td>246</td>\n      <td>1</td>\n      <td>14.0</td>\n      <td>0.5</td>\n      <td>0.5</td>\n      <td>1.00</td>\n      <td>0.0</td>\n      <td>0.3</td>\n      <td>16.299999</td>\n      <td>NaN</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>2</td>\n      <td>2018-12-21 13:48:30</td>\n      <td>2018-12-21 13:52:40</td>\n      <td>3</td>\n      <td>0.0</td>\n      <td>1</td>\n      <td>N</td>\n      <td>236</td>\n      <td>236</td>\n      <td>1</td>\n      <td>4.5</td>\n      <td>0.5</td>\n      <td>0.5</td>\n      <td>0.00</td>\n      <td>0.0</td>\n      <td>0.3</td>\n      <td>5.800000</td>\n      <td>NaN</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>2</td>\n      <td>2018-11-28 15:52:25</td>\n      <td>2018-11-28 15:55:45</td>\n      <td>5</td>\n      <td>0.0</td>\n      <td>1</td>\n      <td>N</td>\n      <td>193</td>\n      <td>193</td>\n      <td>2</td>\n      <td>3.5</td>\n      <td>0.5</td>\n      <td>0.5</td>\n      <td>0.00</td>\n      <td>0.0</td>\n      <td>0.3</td>\n      <td>7.550000</td>\n      <td>NaN</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>2</td>\n      <td>2018-11-28 15:56:57</td>\n      <td>2018-11-28 15:58:33</td>\n      <td>5</td>\n      <td>0.0</td>\n      <td>2</td>\n      <td>N</td>\n      <td>193</td>\n      <td>193</td>\n      <td>2</td>\n      <td>52.0</td>\n      <td>0.0</td>\n      <td>0.5</td>\n      <td>0.00</td>\n      <td>0.0</td>\n      <td>0.3</td>\n      <td>55.549999</td>\n      <td>NaN</td>\n    </tr>\n  </tbody>\n</table>\n</div>",
      "text/plain": "  vendor_id           pickup_at          dropoff_at  passenger_count  \\\n0         1 2019-01-01 00:46:40 2019-01-01 00:53:20                1   \n1         1 2019-01-01 00:59:47 2019-01-01 01:18:59                1   \n2         2 2018-12-21 13:48:30 2018-12-21 13:52:40                3   \n3         2 2018-11-28 15:52:25 2018-11-28 15:55:45                5   \n4         2 2018-11-28 15:56:57 2018-11-28 15:58:33                5   \n\n   trip_distance rate_code_id store_and_fwd_flag  pickup_location_id  \\\n0            1.5            1                  N                 151   \n1            2.6            1                  N                 239   \n2            0.0            1                  N                 236   \n3            0.0            1                  N                 193   \n4            0.0            2                  N                 193   \n\n   dropoff_location_id payment_type  fare_amount  extra  mta_tax  tip_amount  \\\n0                  239            1          7.0    0.5      0.5        1.65   \n1                  246            1         14.0    0.5      0.5        1.00   \n2                  236            1          4.5    0.5      0.5        0.00   \n3                  193            2          3.5    0.5      0.5        0.00   \n4                  193            2         52.0    0.0      0.5        0.00   \n\n   tolls_amount  improvement_surcharge  total_amount  congestion_surcharge  \n0           0.0                    0.3      9.950000                   NaN  \n1           0.0                    0.3     16.299999                   NaN  \n2           0.0                    0.3      5.800000                   NaN  \n3           0.0                    0.3      7.550000                   NaN  \n4           0.0                    0.3     55.549999                   NaN  "
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = wr.s3.read_parquet(path=\"s3://ursa-labs-taxi-data/2019/\")\n",
    "df.head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "The data type is a modin DataFrame"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "modin.pandas.dataframe.DataFrame"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "type(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "However, this type is interoperable with standard pandas calls:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "filtered_df = df[df.trip_distance > 30]\n",
    "excluded_columns = [\"vendor_id\", \"passenger_count\", \"store_and_fwd_flag\"]\n",
    "filtered_df = filtered_df.loc[:, ~filtered_df.columns.isin(excluded_columns)]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "Enter your bucket name:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "bucket = \"BUCKET\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "#### Write data at scale\n",
    "\n",
    "The write operation is parallelized, leading to significant speed-ups"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Data has been written to 408 files\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2023-09-15 12:32:28,917\tWARNING plan.py:567 -- Warning: The Ray cluster currently does not have any available CPUs. The Dataset job will hang unless more CPUs are freed up. A common reason is that cluster resources are used by Actors or Tune trials; see the following link for more details: https://docs.ray.io/en/master/data/dataset-internals.html#datasets-and-tune\n",
      "2023-09-15 12:32:31,094\tINFO streaming_executor.py:92 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[Write]\n",
      "2023-09-15 12:32:31,095\tINFO streaming_executor.py:93 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)\n",
      "2023-09-15 12:32:31,096\tINFO streaming_executor.py:95 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Data has been written to 408 files\n"
     ]
    }
   ],
   "source": [
    "result = wr.s3.to_parquet(\n",
    "    filtered_df,\n",
    "    path=f\"s3://{bucket}/taxi/\",\n",
    "    dataset=True,\n",
    ")\n",
    "print(f\"Data has been written to {len(result['paths'])} files\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "#### Copy to Redshift at scale...\n",
    "\n",
    "Data is first staged in S3 then a [COPY](https://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html) command is executed against the Redshift cluster to load it. Both operations are distributed: S3 write with Ray and COPY in the Redshift cluster"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2023-09-15 12:52:24,155\tINFO streaming_executor.py:92 -- Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[Write]\n",
      "2023-09-15 12:52:24,157\tINFO streaming_executor.py:93 -- Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), locality_with_output=False, preserve_order=False, actor_locality_enabled=True, verbose_progress=False)\n",
      "2023-09-15 12:52:24,157\tINFO streaming_executor.py:95 -- Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`\n"
     ]
    }
   ],
   "source": [
    "# Connect to the Redshift instance\n",
    "con = wr.redshift.connect(\"aws-sdk-pandas-redshift\")\n",
    "\n",
    "path = f\"s3://{bucket}/stage/\"\n",
    "iam_role = \"ROLE\"\n",
    "schema = \"public\"\n",
    "table = \"taxi\"\n",
    "\n",
    "wr.redshift.copy(\n",
    "    df=filtered_df,\n",
    "    path=path,\n",
    "    con=con,\n",
    "    schema=schema,\n",
    "    table=table,\n",
    "    mode=\"overwrite\",\n",
    "    iam_role=iam_role,\n",
    "    max_rows_by_file=None,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "#### ... and UNLOAD it back\n",
    "\n",
    "Parallel calls can also be leveraged when reading from the cluster. The [UNLOAD](https://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html) command distributes query processing in Redshift to dump files in S3 which are then read in parallel into a dataframe"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2023-09-15 12:56:53,838\tINFO read_api.py:374 -- To satisfy the requested parallelism of 16, each read task output will be split into 8 smaller blocks.\n"
     ]
    },
    {
     "data": {
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>pickup_at</th>\n      <th>dropoff_at</th>\n      <th>trip_distance</th>\n      <th>rate_code_id</th>\n      <th>pickup_location_id</th>\n      <th>dropoff_location_id</th>\n      <th>payment_type</th>\n      <th>fare_amount</th>\n      <th>extra</th>\n      <th>mta_tax</th>\n      <th>tip_amount</th>\n      <th>tolls_amount</th>\n      <th>improvement_surcharge</th>\n      <th>total_amount</th>\n      <th>congestion_surcharge</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>2019-01-22 17:40:04</td>\n      <td>2019-01-22 18:33:48</td>\n      <td>30.469999</td>\n      <td>4</td>\n      <td>132</td>\n      <td>265</td>\n      <td>1</td>\n      <td>142.000000</td>\n      <td>1.0</td>\n      <td>0.5</td>\n      <td>28.760000</td>\n      <td>0.00</td>\n      <td>0.3</td>\n      <td>172.559998</td>\n      <td>0.0</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>2019-01-22 18:36:34</td>\n      <td>2019-01-22 19:52:50</td>\n      <td>33.330002</td>\n      <td>5</td>\n      <td>51</td>\n      <td>221</td>\n      <td>1</td>\n      <td>96.019997</td>\n      <td>0.0</td>\n      <td>0.5</td>\n      <td>0.000000</td>\n      <td>11.52</td>\n      <td>0.3</td>\n      <td>108.339996</td>\n      <td>0.0</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>2019-01-22 19:11:08</td>\n      <td>2019-01-22 20:16:10</td>\n      <td>32.599998</td>\n      <td>1</td>\n      <td>231</td>\n      <td>205</td>\n      <td>1</td>\n      <td>88.000000</td>\n      <td>1.0</td>\n      <td>0.5</td>\n      <td>0.000000</td>\n      <td>0.00</td>\n      <td>0.3</td>\n      <td>89.800003</td>\n      <td>0.0</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>2019-01-22 19:14:15</td>\n      <td>2019-01-22 20:09:57</td>\n      <td>36.220001</td>\n      <td>4</td>\n      <td>132</td>\n      <td>265</td>\n      <td>1</td>\n      <td>130.500000</td>\n      <td>1.0</td>\n      <td>0.5</td>\n      <td>27.610001</td>\n      <td>5.76</td>\n      <td>0.3</td>\n      <td>165.669998</td>\n      <td>0.0</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>2019-01-22 19:51:56</td>\n      <td>2019-01-22 20:48:39</td>\n      <td>33.040001</td>\n      <td>5</td>\n      <td>132</td>\n      <td>265</td>\n      <td>1</td>\n      <td>130.000000</td>\n      <td>0.0</td>\n      <td>0.5</td>\n      <td>29.410000</td>\n      <td>16.26</td>\n      <td>0.3</td>\n      <td>176.470001</td>\n      <td>0.0</td>\n    </tr>\n  </tbody>\n</table>\n</div>",
      "text/plain": "            pickup_at          dropoff_at  trip_distance rate_code_id  \\\n0 2019-01-22 17:40:04 2019-01-22 18:33:48      30.469999            4   \n1 2019-01-22 18:36:34 2019-01-22 19:52:50      33.330002            5   \n2 2019-01-22 19:11:08 2019-01-22 20:16:10      32.599998            1   \n3 2019-01-22 19:14:15 2019-01-22 20:09:57      36.220001            4   \n4 2019-01-22 19:51:56 2019-01-22 20:48:39      33.040001            5   \n\n   pickup_location_id  dropoff_location_id payment_type  fare_amount  extra  \\\n0                 132                  265            1   142.000000    1.0   \n1                  51                  221            1    96.019997    0.0   \n2                 231                  205            1    88.000000    1.0   \n3                 132                  265            1   130.500000    1.0   \n4                 132                  265            1   130.000000    0.0   \n\n   mta_tax  tip_amount  tolls_amount  improvement_surcharge  total_amount  \\\n0      0.5   28.760000          0.00                    0.3    172.559998   \n1      0.5    0.000000         11.52                    0.3    108.339996   \n2      0.5    0.000000          0.00                    0.3     89.800003   \n3      0.5   27.610001          5.76                    0.3    165.669998   \n4      0.5   29.410000         16.26                    0.3    176.470001   \n\n   congestion_surcharge  \n0                   0.0  \n1                   0.0  \n2                   0.0  \n3                   0.0  \n4                   0.0  "
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = wr.redshift.unload(\n",
    "    sql=f\"SELECT * FROM {schema}.{table} where trip_distance > 30\",\n",
    "    con=con,\n",
    "    iam_role=iam_role,\n",
    "    path=path,\n",
    "    keep_files=True,\n",
    ")\n",
    "\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "#### Find a needle in a hay stack with S3 Select"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>vendor_id</th>\n      <th>pickup_at</th>\n      <th>dropoff_at</th>\n      <th>passenger_count</th>\n      <th>trip_distance</th>\n      <th>rate_code_id</th>\n      <th>store_and_fwd_flag</th>\n      <th>pickup_location_id</th>\n      <th>dropoff_location_id</th>\n      <th>payment_type</th>\n      <th>fare_amount</th>\n      <th>extra</th>\n      <th>mta_tax</th>\n      <th>tip_amount</th>\n      <th>tolls_amount</th>\n      <th>improvement_surcharge</th>\n      <th>total_amount</th>\n      <th>congestion_surcharge</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>1</td>\n      <td>2019-01-01T00:19:55.000Z</td>\n      <td>2019-01-01T00:57:56.000Z</td>\n      <td>1</td>\n      <td>12.30</td>\n      <td>1</td>\n      <td>N</td>\n      <td>138</td>\n      <td>50</td>\n      <td>1</td>\n      <td>38.0</td>\n      <td>0.5</td>\n      <td>0.5</td>\n      <td>4.00</td>\n      <td>5.76</td>\n      <td>0.3</td>\n      <td>49.060001</td>\n      <td>NaN</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>2</td>\n      <td>2019-01-01T00:48:10.000Z</td>\n      <td>2019-01-01T01:36:58.000Z</td>\n      <td>1</td>\n      <td>31.57</td>\n      <td>1</td>\n      <td>N</td>\n      <td>138</td>\n      <td>138</td>\n      <td>2</td>\n      <td>82.5</td>\n      <td>0.5</td>\n      <td>0.5</td>\n      <td>0.00</td>\n      <td>0.00</td>\n      <td>0.3</td>\n      <td>83.800003</td>\n      <td>NaN</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>1</td>\n      <td>2019-01-01T00:39:58.000Z</td>\n      <td>2019-01-01T00:58:58.000Z</td>\n      <td>2</td>\n      <td>8.90</td>\n      <td>1</td>\n      <td>N</td>\n      <td>138</td>\n      <td>224</td>\n      <td>1</td>\n      <td>26.0</td>\n      <td>0.5</td>\n      <td>0.5</td>\n      <td>8.25</td>\n      <td>5.76</td>\n      <td>0.3</td>\n      <td>41.310001</td>\n      <td>NaN</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>1</td>\n      <td>2019-01-01T00:07:45.000Z</td>\n      <td>2019-01-01T00:34:12.000Z</td>\n      <td>4</td>\n      <td>9.60</td>\n      <td>1</td>\n      <td>N</td>\n      <td>138</td>\n      <td>239</td>\n      <td>1</td>\n      <td>29.0</td>\n      <td>0.5</td>\n      <td>0.5</td>\n      <td>7.20</td>\n      <td>5.76</td>\n      <td>0.3</td>\n      <td>43.259998</td>\n      <td>NaN</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>2</td>\n      <td>2019-01-01T00:27:40.000Z</td>\n      <td>2019-01-01T00:52:15.000Z</td>\n      <td>1</td>\n      <td>12.89</td>\n      <td>1</td>\n      <td>N</td>\n      <td>138</td>\n      <td>87</td>\n      <td>2</td>\n      <td>36.0</td>\n      <td>0.5</td>\n      <td>0.5</td>\n      <td>0.00</td>\n      <td>0.00</td>\n      <td>0.3</td>\n      <td>37.299999</td>\n      <td>NaN</td>\n    </tr>\n    <tr>\n      <th>...</th>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n    </tr>\n    <tr>\n      <th>1167508</th>\n      <td>2</td>\n      <td>2019-06-30T23:42:24.000Z</td>\n      <td>2019-07-01T00:10:28.000Z</td>\n      <td>1</td>\n      <td>15.66</td>\n      <td>1</td>\n      <td>N</td>\n      <td>138</td>\n      <td>265</td>\n      <td>2</td>\n      <td>44.0</td>\n      <td>0.5</td>\n      <td>0.5</td>\n      <td>0.00</td>\n      <td>0.00</td>\n      <td>0.3</td>\n      <td>45.299999</td>\n      <td>0.0</td>\n    </tr>\n    <tr>\n      <th>1167509</th>\n      <td>2</td>\n      <td>2019-06-30T23:07:34.000Z</td>\n      <td>2019-06-30T23:25:09.000Z</td>\n      <td>1</td>\n      <td>7.38</td>\n      <td>1</td>\n      <td>N</td>\n      <td>138</td>\n      <td>262</td>\n      <td>1</td>\n      <td>22.0</td>\n      <td>0.5</td>\n      <td>0.5</td>\n      <td>7.98</td>\n      <td>6.12</td>\n      <td>0.3</td>\n      <td>39.900002</td>\n      <td>2.5</td>\n    </tr>\n    <tr>\n      <th>1167510</th>\n      <td>2</td>\n      <td>2019-06-30T23:00:36.000Z</td>\n      <td>2019-06-30T23:20:18.000Z</td>\n      <td>1</td>\n      <td>11.24</td>\n      <td>1</td>\n      <td>N</td>\n      <td>138</td>\n      <td>107</td>\n      <td>1</td>\n      <td>31.0</td>\n      <td>0.5</td>\n      <td>0.5</td>\n      <td>8.18</td>\n      <td>6.12</td>\n      <td>0.3</td>\n      <td>49.099998</td>\n      <td>2.5</td>\n    </tr>\n    <tr>\n      <th>1167511</th>\n      <td>1</td>\n      <td>2019-06-30T23:08:06.000Z</td>\n      <td>2019-06-30T23:30:20.000Z</td>\n      <td>1</td>\n      <td>7.50</td>\n      <td>1</td>\n      <td>N</td>\n      <td>138</td>\n      <td>229</td>\n      <td>1</td>\n      <td>24.0</td>\n      <td>3.0</td>\n      <td>0.5</td>\n      <td>4.00</td>\n      <td>0.00</td>\n      <td>0.3</td>\n      <td>31.799999</td>\n      <td>2.5</td>\n    </tr>\n    <tr>\n      <th>1167512</th>\n      <td>2</td>\n      <td>2019-06-30T23:15:13.000Z</td>\n      <td>2019-06-30T23:35:18.000Z</td>\n      <td>2</td>\n      <td>8.73</td>\n      <td>1</td>\n      <td>N</td>\n      <td>138</td>\n      <td>262</td>\n      <td>1</td>\n      <td>25.5</td>\n      <td>0.5</td>\n      <td>0.5</td>\n      <td>1.77</td>\n      <td>6.12</td>\n      <td>0.3</td>\n      <td>37.189999</td>\n      <td>2.5</td>\n    </tr>\n  </tbody>\n</table>\n<p>1167513 rows × 18 columns</p>\n</div>",
      "text/plain": "        vendor_id                 pickup_at                dropoff_at  \\\n0               1  2019-01-01T00:19:55.000Z  2019-01-01T00:57:56.000Z   \n1               2  2019-01-01T00:48:10.000Z  2019-01-01T01:36:58.000Z   \n2               1  2019-01-01T00:39:58.000Z  2019-01-01T00:58:58.000Z   \n3               1  2019-01-01T00:07:45.000Z  2019-01-01T00:34:12.000Z   \n4               2  2019-01-01T00:27:40.000Z  2019-01-01T00:52:15.000Z   \n...           ...                       ...                       ...   \n1167508         2  2019-06-30T23:42:24.000Z  2019-07-01T00:10:28.000Z   \n1167509         2  2019-06-30T23:07:34.000Z  2019-06-30T23:25:09.000Z   \n1167510         2  2019-06-30T23:00:36.000Z  2019-06-30T23:20:18.000Z   \n1167511         1  2019-06-30T23:08:06.000Z  2019-06-30T23:30:20.000Z   \n1167512         2  2019-06-30T23:15:13.000Z  2019-06-30T23:35:18.000Z   \n\n         passenger_count  trip_distance rate_code_id store_and_fwd_flag  \\\n0                      1          12.30            1                  N   \n1                      1          31.57            1                  N   \n2                      2           8.90            1                  N   \n3                      4           9.60            1                  N   \n4                      1          12.89            1                  N   \n...                  ...            ...          ...                ...   \n1167508                1          15.66            1                  N   \n1167509                1           7.38            1                  N   \n1167510                1          11.24            1                  N   \n1167511                1           7.50            1                  N   \n1167512                2           8.73            1                  N   \n\n         pickup_location_id  dropoff_location_id payment_type  fare_amount  \\\n0                       138                   50            1         38.0   \n1                       138                  138            2         82.5   \n2                       138                  224            1         26.0   \n3                       138                  239            1         29.0   \n4                       138                   87            2         36.0   \n...                     ...                  ...          ...          ...   \n1167508                 138                  265            2         44.0   \n1167509                 138                  262            1         22.0   \n1167510                 138                  107            1         31.0   \n1167511                 138                  229            1         24.0   \n1167512                 138                  262            1         25.5   \n\n         extra  mta_tax  tip_amount  tolls_amount  improvement_surcharge  \\\n0          0.5      0.5        4.00          5.76                    0.3   \n1          0.5      0.5        0.00          0.00                    0.3   \n2          0.5      0.5        8.25          5.76                    0.3   \n3          0.5      0.5        7.20          5.76                    0.3   \n4          0.5      0.5        0.00          0.00                    0.3   \n...        ...      ...         ...           ...                    ...   \n1167508    0.5      0.5        0.00          0.00                    0.3   \n1167509    0.5      0.5        7.98          6.12                    0.3   \n1167510    0.5      0.5        8.18          6.12                    0.3   \n1167511    3.0      0.5        4.00          0.00                    0.3   \n1167512    0.5      0.5        1.77          6.12                    0.3   \n\n         total_amount  congestion_surcharge  \n0           49.060001                   NaN  \n1           83.800003                   NaN  \n2           41.310001                   NaN  \n3           43.259998                   NaN  \n4           37.299999                   NaN  \n...               ...                   ...  \n1167508     45.299999                   0.0  \n1167509     39.900002                   2.5  \n1167510     49.099998                   2.5  \n1167511     31.799999                   2.5  \n1167512     37.189999                   2.5  \n\n[1167513 rows x 18 columns]"
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import awswrangler as wr\n",
    "\n",
    "# Run S3 Select query against all objects for 2019 year to find trips starting from a particular location\n",
    "wr.s3.select_query(\n",
    "    sql='SELECT * FROM s3object s where s.\"pickup_location_id\" = 138',\n",
    "    path=\"s3://ursa-labs-taxi-data/2019/\",\n",
    "    input_serialization=\"Parquet\",\n",
    "    input_serialization_params={},\n",
    "    scan_range_chunk_size=32 * 1024 * 1024,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "awswrangler-v9JnknIF-py3.8",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  },
  "vscode": {
   "interpreter": {
    "hash": "83297b058d59ee0acd247586c837429190a8258f15c0eea6234359f5557dde51"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}